Using the R Programming Environment

Introduction

GreenBlueBar.gif GreenBlueBar.gif

In recent years a programming environment named R has become more and more popular. It was originally written under the name "S" at Bell Labs by Chambers, Becker, and Wilks in the early 1990s or late 1980s. Along the way the companion programming environment named R was introduced, along with a commercial product named S-PLUS. There are some minor differences among the three, but nothing major. R appears to have made S obsolete, and I don't think that you can even find it anymore. We will concentrate on R because it is freely available with a huge collection of functions that other people have added to it and are continuing to add. I don't know the formal distinction between a language and a programming environment--people would call Fortran a language but they call R a programming environment. Similarly, given my (old) background, I would call what we write a "program," whereas others would call it a code snippet or a command file. I'll stick with "program" or "code" even if the former is out-of-date.

It is important to recognize that R came out of the Unix operating system environment, which explains some of the features you will come across. For example, Unix (and Linux) is case-sensitive, so "Print" and "print" are two different commands, and "Print" doesn't exist and will give an error message. To those of us who grew up with early programming languages or with Windows and the Mac, Unix takes a bit of getting used to. If someone tries to get you to write the command "print("Hello World")" or refers to a dummy file name as "foo" or "foobar," its a good bet you are up against a Unix creature. (They all love instructing you to write out "Hello world" as your first step -- can't we be more creative?)

Whereas most people generally write a program and then execute it, Unix types frequently like to work with what is called "the Command Line." This means that you type a command and it is executed, then you type the next command and it is executed, etc. We will do some of that, but it takes some getting used to. We will generally combine commands into a "program" and execute some or all of that at once. Finally, Unix creatures have a command called "man" which prints out help ("manual") pages. So if you don't know how to change your working directory you type "man(cwd)" and it will tell you. (Of course that assumes that you know that the name of the command is "cwd," but doesn't everyone know that?) R uses the same kind of help system, although the command is "help(setwd)" or, equivalently, "?setwd". That's great because help is always available, but its bad because the help pages are not always as clear as you would like--in fact some of them make no sense to me.

These pages will not make you an accomplished R programmer. I hope that they will at least make you sort of a half-assed programmer. If they do, there are lots of books that will help you take it from there. My intent is to show you how to read in data, how to transform them if necessary, and how to use them to perform statistical calculations. Although R is not a statistical language, its greatest development has been in the fields of statistics, about which I know a reasonable amount, and bioinformatics, about which I know less than nothing. There is almost nothing in statistics that you can't do in R, and if you want to do something even slightly complicated, such as computing a logistic regression, someone has already been there ahead of you and written functions to do that. You just have to call the function and give it the right information.

Good texts to use

If you want a decent text for R , and I hope you do, there are a couple that I can recommend. Perhaps the best book that I know is Any Field's Discovering Statistics Using R. It is a mammoth book with 957 pages and weighs more than almost any book you have, but Andy writes in a very entertaining way and you can learn a lot. Everitt and Hothorn (2006), A handbook of statistical analyses using R is a good gentle introduction. Finally, perhaps the best of the straight R books is Crawley (2007) The R book. And then there is R in Action by Robert Kabacoff. It does a nice job of threading the way between learning statistics and learning R. That is probably my second choice of the best R book. But first of all, look for tutorials on the Web. There are lots of them and some are quite good. Just ask Google. One that I would recommend is called "The R Guide" and can be downloaded from http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf. This guide covers quite a bit of material in a very comprehensible way. But one that I recently discovered and like even better is by Kelly Black at http://www.cyclismo.org/tutorial/R/. Another very good set of web pages is by William King at Costal Carolina Univeristy

I mentioned that R has a whole set of help screens. I also said that they are not always clear. I have recently discovered that if I want to use the ifelse() command, I am probably better off going to Google and searching for "ifelse R" than I am with ?ifelse.

First you need some data

Well, that isn't completely true, but data will help. Most of my examples that use data involve data files from Statistical Methods for Psychology, 9th ed. (There is a similar source for data from the 9th edition of Fundamental Statistics for the Behavioral Sciences), which you might want to search if you cannont find the file you want.) You can download a compressed ("zipped") file containing the data for the 9th edition of the "methods" book from https://www.uvm.edu/~dhowell/methods9/DataFilesASCII.zip . Depending on your machine, you will be asked if you want to open the file or save it. I suggest that you save it to a place that will store your material for this course, such as a folder called "c:/Stat Methods." You can then click on that zipped file and it will create txt files of all of the data sets in the book. If I were really organized, that's what I would do. But unfortunately I'm not really organized. (I have copies of files all aver the place, which is a dumb idea.) Having the data in one place makes it much easier to get at the files when you need them.

having downloaded the files to your own machine, you can start each set of code by specifying the default directory. RStudio has a dropdown menu called "Session" which will allow you to do that very easily. It will enter a line of code that looks similar to 'setwd("~/Dropbox/Webs/methods9/DataFiles").' Now all you have to do is supply the file name, as in 'dat <- read.table("myFile.txt", header = TRUE).' That saves a lot of typing. But that only works if the "setwd()" points to a file on your computer. You have to get a bit sneakier if you want it to point to a file on my server at the University of Vermont.

There is another alternative, which involves calling up the data directly from my server. For example, if you are wroking on Exercise 15.6, you could enter the following command at the beginning of your code. 'Data <- read.file("https://www.uvm.edu/~dhowell/methods9/DataFiles/Ex15-6.dat", header = TRUE)'. That is the way I will often show the code in the text. But it requires a bit too much typing. However, there is a shortcut R. It may look akward, but it saves a lot of typing. Assume that you want to load a data file named "Ex15-6.dat." At the beginning of each set of code, enter

     rm(list = ls())
     url <- "http://www.uvm.edu/~dhowell/methods9/DataFiles/"
     fileName <- paste(url,"survrate.dat",  sep = '')
     data3 <- read.table(fileName, header = TRUE)
     names(data3)

That command will first clear out any old variables that you have accumulated, just to be safe. Then it will create a variable that is the basic address of the data files on the server. It will next take that address and add the name of the file that you are going to want to use. Finally, it will read the data. I will try to ensure that any code I present in the book (at least on web pages) will have those five lines. All you will have to do is supply the name of the file itself. (However, sometimes I will simply set the default directory on my own machine and let you figure out what directory you want on your machine.

An Outline of these Pages

I am going to split these pages into several different units just so that no unit becomes too long. You can always click your way there from here. I will begin with a page on downloading R and related files. As I said elsewhere, if you can install iTunes you can install R. But along with R it is helpful, but not required, to have a good editor. I will discuss a couple of those in that section, and ultimately recommend RStudio.

Next I will examine a simple example in which you enter some commands, set up some data, and run an analysis. Because this is the beginning, and many people will be using these pages along side an ongoing statistics course, the first few examples will involve fairly elementary statistics. In this section I am not going to say much about the specific commands we will use. I just want you to see what can be done.

In the following section I will lay out the basic information about reading in data, creating new variables, doing some simple calculations, and printing out results. This section will mainly focus on data manipulation, which R is very good at. I can not possibly burden you with everything that R will do, but we will cover the basics.

One of the things that R does best is graphics. We will have a whole section devoted to creating meaningful graphs. My goal is to give you annotated code so that you can later steal that code, change the variable names and the text, and produce the same kinds of graphs. Personally I find it easiest to learn by looking at what someone else did and then adapting it to my needs. That is what this section will attempt to do.

Specific Topics

More Stuff to follow
GreenBlueBar.gif GreenBlueBar.gif dch:

Free JavaScripts provided
by The JavaScript Source